Reliable Measures for Aligning Japanese-English News Articles and Sentences
نویسندگان
چکیده
We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval (CLIR) to align the Japanese and English articles and then used a method based on dynamic programming (DP) matching to align the Japanese and English sentences in these articles. However, the results included many incorrect alignments. To remove these, we propose two measures (scores) that evaluate the validity of alignments. The measure for article alignment uses similarities in sentences aligned by DP matching and that for sentence alignment uses similarities in articles aligned by CLIR. They enhance each other to improve the accuracy of alignment. Using these measures, we have successfully constructed a largescale article and sentence alignment corpus available to the public.
منابع مشابه
Machine Translation of Sentences with Fixed Expressions
This paper presents a practical machine translation system based on sentence types for economic news stories. Conventional English-to-Japanese machine translation (MT) systems which are rule-based approaches, are difficult to translate certain types of Associated Press (AP) wire service news stories, such as economics and sports, because these topics include many fixed expressions (such as comp...
متن کاملDetection of Difference between News Articles on the Same Topic Based on Sequential Comparison
Currently, a lot of news articles are published on theWeb, and it is getting easier for us to read them. However, the number of articles are too large for us to read all of them. Although some Web sites cluster/classify news articles into some topics (categories), it is not enough since a large number of articles are still in each topic. Detecting difference between articles on one topic will b...
متن کاملTranslation of News Headlines
Machine-Translation of news headlines is difficult since the sentences are fragmentary and abbreviations and acronyms of proper names are frequently used. Another difficulty is that, since the headline comes at the top of a news article, the context information useful to disambiguate the sense of words and to determine their translation(target word) is not available. This paper proposes a new a...
متن کاملAligning Parallel Bilingual Corpora Statistically with Punctuation Criteria
We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...
متن کاملAutomatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary
One of the crucial parts of any corpus-based machine translation system is a large-scale bilingual corpus that is aligned at various levels such, as the sentence and phrase levels. This kind of corpus, however, is not easy to obtain, and accordingly, there is a great need for an efficient construction method. We approach this problem by integrating two large monolingual corpora in two different...
متن کامل